Homework 1

Required libraries

Use the FiveThirtyEight presidential elections data to answer the following questions about the 2020 general election results.

url<-"https://raw.githubusercontent.com/fivethirtyeight/election-results/main/election_results_presidential.csv"
presidential_elections<-read_csv(url)

voteshares <-presidential_elections|>
  filter(office_name=="U.S. President" & cycle==2020 & stage=="general")|>
  filter(candidate_name %in% c("Joe Biden", "Donald Trump"))|>
  filter(!str_detect(state, " CD-\\d+"))|>
  group_by(state, candidate_name)|>
  summarise(votes=sum(votes,na.rm =T))|>
  pivot_wider(names_from = candidate_name, values_from = votes)|>
    mutate(`Biden Share`=`Joe Biden`/(`Joe Biden`+`Donald Trump`))

Question 1

Use the following code to download demographics for each state from the American Community Survey and then use a join to add this column to your data.

Answer

acs_data <- get_acs(geography = "state", 
                         variables = c(median_income = "B19013_001",
                                       bachelors_degree = "B15003_022",
                                       pop = "B01001_001",
                                       white_pop = "B02001_002"
                                       ), 
                         year = 2020)

voteshares<-acs_data|>
  select(-moe)|>
  pivot_wider(names_from = variable, values_from=estimate)|>
  right_join(voteshares, by = join_by(NAME == state))

(note: you can ignore the “moe” column and just use the estimates. You’ll need to use pivot_wider to put these into wide format with one row per state)

Question 2

Run a linear regression to calculate the effect of median income on Biden’s statewide two party vote share. Produce a formatted table to display your results and briefly discuss your findings.

Answer

The basic code here is straightforward, but there are some things you can do to improve the formatting:

  1. One problem that comes up a lot is that the coefficient values on things like median income will be so tiny that they show up as 0.00 in your model output, even though they actually have a substantial effect on your dependent variable. Where it makes sense to do so, you can measure values like this in larger increments, like $1,000 or even $10,000. This won’t impact your actual results. You can do the same thing for the dependent variable to convert a proportion to a percentage.
  2. You should generally rename covariates like median_income to something more descriptive. R has lots of limitations on how you can name variables, but those limitations don’t limit what you can show in your model output.
  3. If applicable, add a note to indicate that your output shows standard errors in parentheses.
  4. It might make sense to round some of your statistics to the nth decimal place.
  5. In some cases you might also consider mean-centering the independent variables. This will give you an interpretable intercept term.
# Q2
library(modelsummary)
voteshares<-voteshares|>
  mutate(income10k = median_income/10000,
         biden_percent = `Biden Share` * 100,
         )
model1<-lm(biden_percent ~ income10k, data= voteshares)
modelsummary(list("DV: % Biden vote in 2020 election" = model1),
             coef_rename = c(income10k = "Median income ($10,000)"),
             title = "% statewide Biden vote in 2020 by median income",
             fmt = fmt_statistic(estimate = 2, std.error=2), # rounding estimates and se values
             notes = "Std. err in parentheses"
             )
tinytable_5x0e2cfs7mh4i1634brv
% statewide Biden vote in 2020 by median income
DV: % Biden vote in 2020 election
Std. err in parentheses
(Intercept) -2.20
(7.33)
Median income ($10,000) 7.98
(1.11)
Num.Obs. 51
R2 0.513
R2 Adj. 0.503
AIC 369.1
BIC 374.9
Log.Lik. -181.562
F 51.599
RMSE 8.51

One way to enhance your analysis is by plotting your regression predictions. This is especially helpful in cases like this one where you have a relatively small number of observations:

library(ggrepel)






dc<-data.frame(state = "District of Columbia" , abb = "DC")
states_table<-data.frame("state" = state.name, "abb"= state.abb)|>
  bind_rows(dc)


voteshares|>
  left_join(states_table, by=join_by(NAME ==state))|>
ggplot(
       aes(x = income10k, 
           y =biden_percent,
           label = abb,
          
           )
       ) +
  geom_hline(yintercept = 50, color='lightgrey', lty=2) +
  geom_point(aes(color  = biden_percent > 50))+
  geom_smooth(method='lm', color='darkgrey', lty=2, se=FALSE) +
  geom_text_repel(aes(color  = biden_percent > 50)) +
  theme_bw() +
  xlab("Median income (1,000$)") +
  ylab("Biden % of two-party vote") +
  scale_color_manual(guide='none', values=c('#DC143C', 'blue'))  +
  theme( panel.grid.major = element_blank(),
         panel.grid.minor = element_blank()
    )

Another way to enhance your analysis is to talk about how your model predicted results might change under different scenarios. For instance, it would take a $1,600 (.798 * 1.6 = 1.27) increase in Texas’s median income to move it from a predicted loss to a predicted win:

texas_pred<-predict(model1, newdata=voteshares|>filter(NAME == "Texas"))
needed <- 50-texas_pred
coef<-7.98
(10000 * (needed/7.98))
      1 
1596.53 
library(ggrepel)






dc<-data.frame(state = "District of Columbia" , abb = "DC")
states_table<-data.frame("state" = state.name, "abb"= state.abb)|>
  bind_rows(dc)


voteshares|>
  left_join(states_table, by=join_by(NAME ==state))|>
ggplot(
       aes(x = income10k, 
           y =biden_percent,
           label = abb,
          
           )
       ) +
  geom_hline(yintercept = 50, color='lightgrey', lty=2) +
  geom_point(aes(color  = biden_percent > 50))+
  geom_smooth(method='lm', color='darkgrey', lty=2, se=FALSE) +
  geom_text_repel(aes(color  = biden_percent > 50)) +
  theme_bw() +
  xlab("Median income (1,000$)") +
  ylab("Biden % of two-party vote") +
  scale_color_manual(guide='none', values=c('#DC143C', 'blue'))  +
  theme( panel.grid.major = element_blank(),
         panel.grid.minor = element_blank()
    )

Question 3

Run the previous model, but include Bachelor’s degrees and population. Put your results in a formatted table and assess your results.

Answer

Again, the code here is relatively straightforward, and the advice above all applies here.

When you have multiple models, its often helpful to include them side-by-side so readers can compare results:

voteshares<-voteshares|>
  mutate(bachelors_pct = (bachelors_degree/pop) * 100,
         population100k = pop/100000
         
         )

model2<-lm(biden_percent ~ income10k + bachelors_pct + population100k, data= voteshares)
mlist<-list("Model 1" = model1, "Model 2" = model2)

modelsummary(mlist,
             coef_rename = c(income10k = "Median income ($10,000)", 
                             bachelors_pct = "% Bachelor's degree or higher",
                             population100k = "Population (100,000)"
                             ),
             title = "% statewide Biden vote in 2020 by median income",
             fmt = fmt_statistic(estimate = 2, std.error=2), 
             notes = "Std. err in parentheses"
             )
tinytable_5e9rsj9rpac26z4tzqe8
% statewide Biden vote in 2020 by median income
Model 1 Model 2
Std. err in parentheses
(Intercept) -2.20 -10.11
(7.33) (7.64)
Median income ($10,000) 7.98 4.44
(1.11) (1.76)
% Bachelor's degree or higher 2.18
(0.90)
Population (100,000) 0.02
(0.02)
Num.Obs. 51 51
R2 0.513 0.577
R2 Adj. 0.503 0.550
AIC 369.1 365.9
BIC 374.9 375.5
Log.Lik. -181.562 -177.939
F 51.599 21.407
RMSE 8.51 7.93

Question 4

Check your model for heteroskedasticity, non-linearity, or influential outliers. Discuss your interpretation.

Answer

(You may need to adjust the figure width and height to get readable plots here)

There aren’t any major red flags here. There’s no significant heteroskedasticity, the residuals are mostly normal, and there are no signs of significant collinearity or outliers.

OK: Error variance appears to be homoscedastic (p = 0.225).
OK: residuals appear as normally distributed (p = 0.099).
OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)

While its not strictly necessary here, we can easily get heteroskedasticity robust standard errors by using the vcov argument in the modelsummary package.

modelsummary(mlist,
             coef_rename = c(income10k = "Median income ($10,000)", 
                             bachelors_pct = "% Bachelor's degree or higher",
                             population100k = "Population (100,000)"
                             ),
             title = "% statewide Biden vote in 2020 by median income",
             fmt = fmt_statistic(estimate = 2, std.error=2), 
             notes = "Robust standard errors in parentheses",
             vcov = "HC3" # robust to heteroskedasticity, small sample bias, and leverage
             )
tinytable_50zj64yik5635otokqw0
% statewide Biden vote in 2020 by median income
Model 1 Model 2
Robust standard errors in parentheses
(Intercept) -2.20 -10.11
(8.89) (10.38)
Median income ($10,000) 7.98 4.44
(1.42) (2.14)
% Bachelor's degree or higher 2.18
(1.06)
Population (100,000) 0.02
(0.01)
Num.Obs. 51 51
R2 0.513 0.577
R2 Adj. 0.503 0.550
AIC 369.1 365.9
BIC 374.9 375.5
Log.Lik. -181.562 -177.939
F 31.531 16.139
RMSE 8.51 7.93
Std.Errors HC3 HC3

Or we can get bootstrapped standard errors:

set.seed(1000)
modelsummary(mlist,
             coef_rename = c(income10k = "Median income ($10,000)", 
                             bachelors_pct = "% Bachelor's degree or higher",
                             population100k = "Population (100,000)"
                             ),
             title = "% statewide Biden vote in 2020 by median income",
             fmt = fmt_statistic(estimate = 2, std.error=2), 
             notes = "Bootstrapped standard errors in parentheses",
             vcov = "bootstrap" 
             )
tinytable_7kcr2t7hwsxzf5enuqc4
% statewide Biden vote in 2020 by median income
Model 1 Model 2
Bootstrapped standard errors in parentheses
(Intercept) -2.20 -10.11
(8.02) (10.12)
Median income ($10,000) 7.98 4.44
(1.25) (1.77)
% Bachelor's degree or higher 2.18
(0.94)
Population (100,000) 0.02
(0.02)
Num.Obs. 51 51
R2 0.513 0.577
R2 Adj. 0.503 0.550
AIC 369.1 365.9
BIC 374.9 375.5
Log.Lik. -181.562 -177.939
RMSE 8.51 7.93
Std.Errors Bootstrap Bootstrap

Extra code

Logging an IV

It’s probably not necessary in this case, but sometimes we might want to change variables like population to use a log scale. This reduces the influence of outliers, and can reflect the potential non-linear relationship beween an IV and a DV. Note that the log scaling will change our interpretation of the results. When an independent variable has been log-transformed, you can interpret the coefficient values as “a 1% increase in [the IV] is associated with a [coefficient]/100 unit increase in the DV”. So: in model 3, a 1% increase in population is associated with a .01% increase in Biden’s vote share in a state.

model3<-lm(biden_percent ~ income10k + bachelors_pct + log(population100k), data= voteshares)
mlist<-list("Model 1" = model1, "Model 2" = model2, "Model 3"  = model3)

modelsummary(mlist,
             coef_rename = c(income10k = "Median income ($10,000)", 
                             bachelors_pct = "% Bachelor's degree or higher",
                             population100k = "Population (100,000)",
                             `log(population100k)` = "Log population (100,000)"
                             ),
             title = "% statewide Biden vote in 2020 by median income",
             fmt = fmt_statistic(estimate = 2, std.error=2), 
             notes = "Std. err in parentheses"
             )
tinytable_fzi05tvy84jmxfb09x7o
% statewide Biden vote in 2020 by median income
Model 1 Model 2 Model 3
Std. err in parentheses
(Intercept) -2.20 -10.11 -14.25
(7.33) (7.64) (8.74)
Median income ($10,000) 7.98 4.44 4.63
(1.11) (1.76) (1.76)
% Bachelor's degree or higher 2.18 2.15
(0.90) (0.90)
Population (100,000) 0.02
(0.02)
Log population (100,000) 1.23
(1.12)
Num.Obs. 51 51 51
R2 0.513 0.577 0.575
R2 Adj. 0.503 0.550 0.548
AIC 369.1 365.9 366.2
BIC 374.9 375.5 375.8
Log.Lik. -181.562 -177.939 -178.077
F 51.599 21.407 21.208
RMSE 8.51 7.93 7.95

Another way to interpret the effect of a logged IV is through examples and predictions. For instance, we might look at predicted outcomes for all states at different values of population:

library(marginaleffects)

nd<-datagrid(newdata = voteshares,
         grid_type = 'counterfactual', 
         population100k = seq(5, 400, by =1)
         )


plot_predictions(model3,  
                 points = .5, # show actual data points
                 newdata = nd,
                 by = 'population100k'
                 ) +
  theme_bw() +
  labs(x = "Population 100K", y='% Biden vote')

We can also look at the average effect of increasing population by 1 unit on the predicted outcome:

avg_comparisons(model3,variables='population100k')

 Estimate Std. Error   z Pr(>|z|)   S   2.5 % 97.5 %
   0.0507     0.0462 1.1    0.272 1.9 -0.0398  0.141

Term: population100k
Type: response
Comparison: +1

Or we can do something like look at the effect of moving population from its minimum to its maximum:

avg_comparisons(model3,variables=list('population100k' = 'minmax'))

 Estimate Std. Error   z Pr(>|z|)   S 2.5 % 97.5 %
      5.2       4.73 1.1    0.272 1.9 -4.08   14.5

Term: population100k
Type: response
Comparison: Max - Min